Evaluation of BMI Prediction Models Among Patients Who Have Undergone a Stroke Evaluation
Preliminary information
Because patients who are monitored for stroke status during hospital visits are often watched closely and put through a myriad of tests, they may have exceptionally rich EHR data and be useful clinical research subjects. The purpose of this study is to use EHR data pulled from patients who had been evaluated for stroke to create a useful model to predict BMI.
1 Setup and Data Ingest
1.1 Setup
The packages and parameters used in this code chunk are present in order to generate a legible report.
library(knitr)
library(rmdformats)
library(rmarkdown)
options(max.print="100")
knitr::opts_chunk$set(echo = TRUE)
knitr::opts_chunk$set(comment = NA)
options(width = 70)1.2 Loading Necessary R Packages
Each of the following packages were necessary for the analysis and/or graphical display of information used in this project.
setwd(dirname(getwd()))
source("Love-boost.R")
library(janitor)
library(Epi)
library(kableExtra)
library(GGally)
library(ggforce)
library(ggstance)
library(modelsummary)
library(ggdist)
library(gghalves)
library(ggmosaic)
library(car)
library(tidyr)
library(magrittr)
library(mosaic)
library(equatiomatic)
library(simputation)
library(patchwork)
library(broom)
library(naniar)
library(tidyverse)
theme_set(theme_bw())1.3 Data ingest
The following code is present for the purpose of ingesting the raw
data, which is a .csv file. Additionally, data types were standardized
in this step by converting necessary character variables into factors
and retaining data for bmi as numeric. The id
variable is maintained as a character variable here as it is just a
unique numeric identifier for each patient. It will only be used for
identification purposes in subsequent analyses, not as a quantitative or
categorical variable that affects the outcome of interest.
stroke_raw <- read.csv("STROKEDATA.csv") |>
mutate(across(where(is.character), as.factor)) |>
mutate(id = as.character(id)) |>
mutate(across(where(is.integer), as.factor)) |>
mutate(bmi = as.character(bmi)) |>
mutate(bmi = as.numeric(bmi)) 2 Cleaning the Data
The following code is present for the purpose of creating a
preliminary data frame in the form of a tibble that contains complete
cases of each of the variables being analyzed in this project. It should
be noted that only employed individuals are being considered in this
study, and for that reason, two categories are omitted from the
work type variable. This code also renames the components
of the stroke variable and re-levels the components of the
smoking_status variable in order to have more communicable
results in subsequent analyses. Finally, a brief summary of the analytic
tibble with complete cases, named stroke_complete_cases is
provided.
stroke_raw <- stroke_raw |>
filter(complete.cases(work_type)) |>
filter(work_type == "Govt_job" |
work_type == "Private" |
work_type == "Self-employed") |>
droplevels()
stroke_raw$smoking_status <- recode_factor(stroke_raw$smoking_status,
"Unknown" = "N/A")
stroke_complete_cases <- stroke_raw |>
filter(complete.cases(id,
age,
work_type,
Residence_type,
avg_glucose_level,
bmi,
smoking_status, stroke),
bmi != "N/A", smoking_status != "N/A", gender != "Other")
stroke_complete_cases$stroke <- recode_factor(stroke_complete_cases$stroke,
"0" = "No Stroke",
"1" = "Stroke")
stroke_complete_cases <- stroke_complete_cases |>
mutate(smoking_status = fct_relevel(smoking_status,
"never smoked",
"formerly smoked",
"smokes")) |>
droplevels()
stroke_complete_cases <- stroke_complete_cases |>
as_tibble(stroke_complete_cases)
glimpse(stroke_complete_cases)Rows: 3,343
Columns: 12
$ id <chr> "9046", "31112", "60182", "1665", "56669",…
$ gender <fct> Male, Male, Female, Female, Male, Male, Fe…
$ age <dbl> 67, 80, 49, 79, 81, 74, 69, 81, 61, 54, 79…
$ hypertension <fct> 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, …
$ heart_disease <fct> 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, …
$ ever_married <fct> Yes, Yes, Yes, Yes, Yes, Yes, No, Yes, Yes…
$ work_type <fct> Private, Private, Private, Self-employed, …
$ Residence_type <fct> Urban, Rural, Urban, Rural, Urban, Rural, …
$ avg_glucose_level <dbl> 228.69, 105.92, 171.23, 174.12, 186.21, 70…
$ bmi <dbl> 36.6, 32.5, 34.4, 24.0, 29.0, 27.4, 22.8, …
$ smoking_status <fct> formerly smoked, never smoked, smokes, nev…
$ stroke <fct> Stroke, Stroke, Stroke, Stroke, Stroke, St…
Checking to make sure that none of the values for
bmi are unrealistically high or low:
The following code is present for the purpose of displaying the
number of bmi observations above 50 (10 points above the
clinical level of morbid obesity as per CDC guidance) and below
10 (8 points below the clinical underweight cutoff point as per
the CDC). This allows for any abnormally large or small values in the
data set to be noted so that they can be omitted.
Checking the high-end of the BMI variable::
sum(stroke_complete_cases$bmi > 50) [1] 58
Checking the low-end of the BMI variable:
sum(stroke_complete_cases$bmi < 10)[1] 0
Although the lowest bmi value in the
stroke_complete_cases data frame appears to be in a
reasonable area for a low BMI per CDC guidance, there are a total of 58
recorded BMI values over 50, which is considerably above the widely
accepted cutoff value for morbid obesity of 40 (as per CDC guidance).
For this reason values of bmi over 50 will be removed from
the data frame, as they are either serious anomalies or possibly
improperly measured/reported values. The following code removes these
values and rechecks the number of abnormal values:
stroke_complete_cases <- stroke_complete_cases[which(stroke_complete_cases$bmi < 50),]
sum(stroke_complete_cases$bmi > 50) [1] 0
sum(stroke_complete_cases$bmi < 10)[1] 0
2.1 Checking the key outcome (BMI) and key predictor (Average Blood Glucose Level)
The following code is present for the purpose of displaying a numeric
summary for the key outcome of the study (bmi) and the key
predictor of the study (avg_glucose_level).
df_stats(~bmi + avg_glucose_level, data = stroke_complete_cases) |>
kbl(caption =
"Numerical Summary of Key Outcome
(BMI) and Key Predictor (Average Blood Glucose Level)", digits = 2) |>
kable_styling(font_size = 20) |>
kable_paper("hover", full_width = F)| response | min | Q1 | median | Q3 | max | mean | sd | n | missing |
|---|---|---|---|---|---|---|---|---|---|
| bmi | 11.50 | 25.30 | 29.10 | 33.90 | 49.90 | 30.00 | 6.38 | 3285 | 0 |
| avg_glucose_level | 55.12 | 77.19 | 92.27 | 115.98 | 271.74 | 108.23 | 47.73 | 3285 | 0 |
As expected, there are no missing values for either of the variables above. Additionally, it can be seen that all values for BMI are between 10 and 50 and the interquartile range of average blood glucose level is below 140, demonstrating that neither of these variables contain abnormal values by the clinical standards reported by the CDC.
2.2 Checking the Quantitative Predictors:
For this analysis, there is only one quantitative predictor that is
not the key predictor (avg_glucose_level), which is
age. The following code provides a numeric summary of
age.
df_stats(~age, data = stroke_complete_cases) |>
kbl(caption = "Numerical summary of quantitative predictor (Age)",
digits = 2) |> kable_styling(font_size = 20) |>
kable_paper("hover",
full_width = F)| response | min | Q1 | median | Q3 | max | mean | sd | n | missing |
|---|---|---|---|---|---|---|---|---|---|
| age | 13 | 35 | 50 | 64 | 82 | 49.59 | 18.29 | 3285 | 0 |
It can be seen above that a wide range of ages will be included in this study, including ages as low as 13 and as high as 82. The average and median ages appear to be around 50.
2.3 Checking the Categorical Variables:
In this section, numeric summaries each of the categorical
variables are individually provided. Each of these variables are to be
used as a predictor for the key outcome, bmi.
2.3.1 Residence type
The following code provides a numeric summary of the
Residence_type variable:
stroke_complete_cases |>
tabyl(Residence_type) |>
kbl(caption = "Summary of Residence Type", digits = 2) |>
kable_styling(font_size = 20) |>
kable_paper("hover", full_width = F)| Residence_type | n | percent |
|---|---|---|
| Rural | 1613 | 0.49 |
| Urban | 1672 | 0.51 |
It can be seen in the above table that there is nearly a 50/50 split between people included in this study that live in a rural environment and people included in this study that live in an urban environment.
2.3.2 Stroke
The following code provides a numeric summary of the
stroke variable:
stroke_complete_cases |> tabyl(stroke) |>
kbl(caption = "Summary of Stroke Status",digits = 2) |>
kable_styling(font_size = 20) |>
kable_paper("hover",
full_width = F)| stroke | n | percent |
|---|---|---|
| No Stroke | 3106 | 0.95 |
| Stroke | 179 | 0.05 |
The above table demonstrates that only about five percent of the people included in this study had suffered a stroke in their lifetime. It should be noted that everybody in this study was evaluated for a stroke, however.
2.3.3 Work Type
The following code provides a numeric summary of the
work_type variable:
stroke_complete_cases |>
tabyl(work_type) |>
kbl(caption = "Summary of Employment Category",digits = 2) |>
kable_styling(font_size = 20) |>
kable_paper("hover", full_width = F)| work_type | n | percent |
|---|---|---|
| Govt_job | 505 | 0.15 |
| Private | 2159 | 0.66 |
| Self-employed | 621 | 0.19 |
The above table demonstrates that most people included in this study are privately employed. Around 15% of participants appear to be self-employed and 19% appear to be government employees.
2.3.4 Smoking Status
The following code provides a numeric summary of the
smoking_status variable:
stroke_complete_cases |>
tabyl(smoking_status) |>
kbl(caption = "Summary of Smoking Status", digits = 2) |>
kable_styling(font_size = 20) |>
kable_paper("hover", full_width = F)| smoking_status | n | percent |
|---|---|---|
| never smoked | 1751 | 0.53 |
| formerly smoked | 811 | 0.25 |
| smokes | 723 | 0.22 |
It is demonstrated in the table above that close to half of the participants in this study have never smoked in their life, while the remaining portion is split fairly evenly between former smokers and current smokers. It appears that there are more former smokers than current smokers.
2.4 Evaluation of missing cases
The following code is present for the purpose of consolidating
information on all missing values in one place, such that it can be
determined that the stroke_complete_cases tibble can be
considered to be made up of only complete cases. It should be noted that
any missing cases are considered to have been missing completely at
random (MCAR).
miss_var_summary(stroke_complete_cases)# A tibble: 12 × 3
variable n_miss pct_miss
<chr> <int> <dbl>
1 id 0 0
2 gender 0 0
3 age 0 0
4 hypertension 0 0
5 heart_disease 0 0
6 ever_married 0 0
7 work_type 0 0
8 Residence_type 0 0
9 avg_glucose_level 0 0
10 bmi 0 0
11 smoking_status 0 0
12 stroke 0 0
2.5 Printing The analytic tibble
The following code is present for the purpose of renaming the data set containing all of the six variables and printing it to show the final analytic tibble that will be used in all subsequent analyses.
Stroke_Analysis_Data <- stroke_complete_cases |>
select(id, age, bmi,
avg_glucose_level, work_type, Residence_type,
smoking_status, stroke)
Stroke_Analysis_Data# A tibble: 3,285 × 8
id age bmi avg_glucose_level work_t…¹ Resid…² smoki…³ stroke
<chr> <dbl> <dbl> <dbl> <fct> <fct> <fct> <fct>
1 9046 67 36.6 229. Private Urban former… Stroke
2 31112 80 32.5 106. Private Rural never … Stroke
3 60182 49 34.4 171. Private Urban smokes Stroke
4 1665 79 24 174. Self-em… Rural never … Stroke
5 56669 81 29 186. Private Urban former… Stroke
6 53882 74 27.4 70.1 Private Rural never … Stroke
7 10434 69 22.8 94.4 Private Urban never … Stroke
8 12109 81 29.7 80.4 Private Rural never … Stroke
9 12095 61 36.8 120. Govt_job Rural smokes Stroke
10 12175 54 27.3 105. Private Urban smokes Stroke
# … with 3,275 more rows, and abbreviated variable names ¹work_type,
# ²Residence_type, ³smoking_status
3 Codebook and Data Description
3.1 Description of the subjects of this study
The data used for this study was taken from the electronic health records (EHRs) of 3285 persons aged 10-82 who, within their EHR, had explicitly been noted as either having suffered a stroke in their life or having not suffered a stroke in their life. All EHR data used in this study was de-identified and made public by McKinsey and Company, a consulting firm with oversight over a number of US hospital EHRs. All subjects had complete data for each of the variables listed below.
3.2 Codebook
Description of Variables
The following table contains the name of each variable in the
Stroke_Analysis_Data data set used for this analysis, the
type of variable it is, and a brief description of the levels for each
variable. The variable types are denoted as such:
ID = Identification number
Quant = Quantitative variables
Binary = Two-category variables
X-cat = Multi-category variables) where X indicates the number of levels in the variable.
| Variable | Type | Description / Levels |
|---|---|---|
id |
ID | A unique identification number assigned to each participant |
bmi |
Quant | Outcome Variable: Current body mass index of the patient as indicated by that patient’s EHR. Body mass index is calculated by dividing weight in pounds (lb) by height in inches (in) squared and multiplying that value by a conversion factor of 703. |
work_type |
3-Cat | The category of work of the patient as indicated by that patient’s EHR. “Govt_job” indicates that the patient works for the government (public sector), “Private” indicates that the patient is employed and works for a private company (private sector), and “Self-employed” indicates that the patient is self employed. |
stroke |
Binary | Whether or not electronic medical records indicated that the individual had suffered a stroke at some point in their life. “Stroke” indicates that they suffered a stroke, “No Stroke” indicates that they did not. |
smoking_status |
3-Cat | The smoking status of the patient as indicated by that patient’s EHR. “never smoked” indicates that the patient has no history of smoking, “formerly smoked” indicates that they have, but no longer smoke, and “smokes” indicates that the patient currently smokes. |
Residence_type |
2-cat | The residence type in which the patient lives. Urban refers to an urban environment and Rural refers to a rural environment. |
age |
Quant | The age of the patient in years as indicated in that patient’s EHR. |
avg_glucose_level |
Quant | Key Predictor The average blood glucose level for the patient, measured in milligrams per deciliter (Mg/Dl) as indicated in that patient’s EHR |
3.3 Analytic Tibble
The following code is present for the purpose of printing the analytic tibble in order to show that it is, in fact, a tibble.
Stroke_Analysis_Data# A tibble: 3,285 × 8
id age bmi avg_glucose_level work_t…¹ Resid…² smoki…³ stroke
<chr> <dbl> <dbl> <dbl> <fct> <fct> <fct> <fct>
1 9046 67 36.6 229. Private Urban former… Stroke
2 31112 80 32.5 106. Private Rural never … Stroke
3 60182 49 34.4 171. Private Urban smokes Stroke
4 1665 79 24 174. Self-em… Rural never … Stroke
5 56669 81 29 186. Private Urban former… Stroke
6 53882 74 27.4 70.1 Private Rural never … Stroke
7 10434 69 22.8 94.4 Private Urban never … Stroke
8 12109 81 29.7 80.4 Private Rural never … Stroke
9 12095 61 36.8 120. Govt_job Rural smokes Stroke
10 12175 54 27.3 105. Private Urban smokes Stroke
# … with 3,275 more rows, and abbreviated variable names ¹work_type,
# ²Residence_type, ³smoking_status
3.4 Numerical Data Description:
The following code is present for the purpose of displaying a brief
numeric description of each of the variables used in the
Stroke_Analysis_Data data set.
Hmisc::describe(Stroke_Analysis_Data) Stroke_Analysis_Data
8 Variables 3285 Observations
----------------------------------------------------------------------
id
n missing distinct
3285 0 3285
lowest : 10056 10119 10133 10138 10139, highest: 9730 9752 9923 9926 9955
----------------------------------------------------------------------
age
n missing distinct Info Mean Gmd .05
3285 0 70 1 49.59 21.08 20
.10 .25 .50 .75 .90 .95
24 35 50 64 76 79
lowest : 13 14 15 16 17, highest: 78 79 80 81 82
----------------------------------------------------------------------
bmi
n missing distinct Info Mean Gmd .05
3285 0 318 1 30 7.144 21.00
.10 .25 .50 .75 .90 .95
22.40 25.30 29.10 33.90 39.06 42.20
lowest : 11.5 14.1 15.0 15.7 16.0, highest: 49.3 49.4 49.5 49.8 49.9
----------------------------------------------------------------------
avg_glucose_level
n missing distinct Info Mean Gmd .05
3285 0 2806 1 108.2 48.15 60.69
.10 .25 .50 .75 .90 .95
65.85 77.19 92.27 115.98 199.32 220.19
lowest : 55.12 55.22 55.25 55.27 55.32
highest: 266.59 267.60 267.61 267.76 271.74
----------------------------------------------------------------------
work_type
n missing distinct
3285 0 3
Value Govt_job Private Self-employed
Frequency 505 2159 621
Proportion 0.154 0.657 0.189
----------------------------------------------------------------------
Residence_type
n missing distinct
3285 0 2
Value Rural Urban
Frequency 1613 1672
Proportion 0.491 0.509
----------------------------------------------------------------------
smoking_status
n missing distinct
3285 0 3
Value never smoked formerly smoked smokes
Frequency 1751 811 723
Proportion 0.533 0.247 0.220
----------------------------------------------------------------------
stroke
n missing distinct
3285 0 2
Value No Stroke Stroke
Frequency 3106 179
Proportion 0.946 0.054
----------------------------------------------------------------------
4 My Research Question
Body mass index is a widely utilized reporting method for the relationship between the height and weight of a person. Values for BMI that are 20 and over 25 have been associated with higher all-cause mortality, with increasing risk as the distance from the 20–25 range increases. Given the positive correlation between BMI and all-cause mortality risk, models that are able to accurately predict BMI may be useful for determining at-risk patients who may need additional care and resources in the future. Average blood glucose level is an easy to measure metric that can be determined at nearly any clinic, thus it may be advantageous to see if it can be used to predict BMI value.
My research question is as follows:
How effectively can the body mass index of a hospital patient who has been evaluated for a stroke be predicted by the average blood glucose level of the patient that was recorded in their EHR and how is the quality of prediction changed when adjusting for age, employment type, residence type, stroke history, and smoking status?
5 Partitioning the Data
The following code is present for the purpose of randomly segregating
the data from Stroke_Analysis_Data into a “training” data
frame containing 70% of the observations called
Stroke_Analysis_Train, and a second “test” data frame
containing the remaining 30% of the observations, called
Stroke_Analysis_Test.
set.seed(538)
Stroke_Analysis_Train <- Stroke_Analysis_Data |>
slice_sample(prop = .70)
Stroke_Analysis_Test <- anti_join(Stroke_Analysis_Data,
Stroke_Analysis_Train,
by = "id")
dim(Stroke_Analysis_Data)[1] 3285 8
dim(Stroke_Analysis_Train)[1] 2299 8
dim(Stroke_Analysis_Test)[1] 986 8
It can be seen above by the dimensions of the
Stroke_Analysis_Data, Stroke_Analysis_Train,
and Stroke_Analysis_Test data frames respectively, that
each data frame has eight variables, and that 70% of the number of
observations in Stroke_Analysis_Data are included in the
Stroke_Analysis_Train data frame. 30% of the number of
observations from Stroke_Analysis_Data are included in the
Stroke_Analysis_Test data frame.
6 Transforming The Outcome
In this section, the properties of the training data set
(Stroke_Analysis_Train) will be assessed for the purpose of
determining whether or not a transformation is necessary before creating
a linear regression model using the data.
6.1 Visualization of
Stroke_Analysis_Train with no applied transformations
The following code creates a histogram of the distribution of BMI
values across the Stroke_Analysis_Train data.
ggplot(Stroke_Analysis_Train, aes(x = bmi)) +
geom_histogram(bins = 75, fill = "lightsalmon", color = "coral2") +
guides("none") +
labs(title = "Histogram of BMI Values by Employment Category",
x = "BMI") + theme(
panel.background = element_rect(fill = "lemonchiffon1",
colour = "lemonchiffon1",
size = 0.5, linetype = "solid")) +
theme(plot.background = element_rect(fill = "azure2"))It can be seen in the above histogram that the distribution of BMI
within the Stroke_Analysis_Train data set appears to be
fairly normal with some right skew.
The following code is present for the purpose of creating a box and
violin plot juxtaposed with normal QQ plots for bmi,
allowing for further visualization of the distribution of values in the
Stroke_Analysis_Train data set.
p1 <- ggplot(Stroke_Analysis_Train, aes(x ="",
y = bmi)) +
geom_violin(col = "coral2",
fill = "lightsalmon",
alpha = 0.3) +
geom_boxplot(width = 0.3, notch = TRUE,
color = "coral2", fill = "azure2", ) +
guides(fill = "none") +
labs(title = "BMI Values",
subtitle = "For the `Stroke_Analysis_Train` data set.",
x = "`Stroke_Analysis_Train` data set",
y = "Body Mass Index") +
theme(plot.background = element_rect(fill = "azure2"))
p2 <- ggplot(Stroke_Analysis_Train,
aes(sample = bmi,
col = "coral2")) +
geom_qq(fill = "lightsalmon") +
geom_qq_line(col = "darkturquoise") +
guides(col = "none") +
theme_bw() +
labs(y = "BMI ",
title = "Normal QQ Plot",
subtitle = "For the `Stroke_Analysis_Train` data set.") +
theme(plot.background = element_rect(fill = "azure2"))
(p1 + theme(
panel.background = element_rect(fill = "lemonchiffon1",
colour = "lemonchiffon1",
size = 0.5, linetype = "solid"))) + (p2 +
plot_annotation(title = "Overall title") + theme(
panel.background = element_rect(fill = "lemonchiffon1",
colour = "lemonchiffon1",
size = 0.5, linetype = "solid"))) The violin plot above demonstrates a distribution that appears to be fairly normal with some slight skew, a median BMI that is just under 30, and an interquartile range that is between BMI values slightly greater than 25 and slightly less than 35. The normal QQ plot is consistent with what was seen in the histogram and on the violin plot, which is a mostly normal distribution with right skew.
Skew Assessment:
In order to ensure that the right skew that was noticed was not so
much that a transformation will be required for further analysis of this
data, a numerical skew assessment was completed. In this analysis,
nonparametric skew is calculated by taking the difference between the
mean and the median, and dividing it by the standard deviation. This
particular type of skew is based off of Pearson’s notion of median
skewness, with values falling between -1 and +1 for any distribution.
Generally, when this measure of skew is used, values greater than + 0.2
indicate fairly substantial right skew, while values below -0.2 indicate
fairly substantial left skew. If the skewness value for bmi
is between -0.2 and +0.2, I will consider the the data to be normally
distributed.
Stroke_Analysis_Train |>
summarize(skew = round_half_up((mean(bmi) - median(bmi))/sd(bmi), 3)) |>
kbl(caption = " Numerical Skew Assessment for BMI values", digits = 5) |>
kable_styling(font_size = 20) |>
kable_paper("hover",
full_width = F)| skew |
|---|
| 0.158 |
The value derived by taking the difference between the mean and the
median, and dividing it by the standard deviation provided a skewness
value of .158, which is not large enough to consider there to be
substantial right skew in the Stroke_Analysis_Train data
set. For this reason, the distribution of BMI in this set will be
considered to be normal.
6.2 Numerical Summary of
the outcome of interest (BMI) for
Stroke_Analysis_Train
The following code provides a numerical summary for the
Stroke_Analysis_Train data set, which will be used to
create a linear regression model.
favstats(~bmi, data = Stroke_Analysis_Train) |>
kbl(caption = "BMI Numerical Summary", digits = 5) |>
kable_styling(font_size = 20) |>
kable_paper("hover", full_width = F)| min | Q1 | median | Q3 | max | mean | sd | n | missing | |
|---|---|---|---|---|---|---|---|---|---|
| 14.1 | 25.4 | 29 | 33.9 | 49.9 | 30.00474 | 6.3684 | 2299 | 0 |
6.3 Numerical Summaries
of the predictors for Stroke_Analysis_Train
The following code is present for the purpose of providing a
numerical summary of the intended predictors of the linear model that
will be built using Stroke_Analysis_Train data. A total of
four quantitative and two categorical variables will be included.
Stroke_Analysis_Train |> select(-id, -bmi ) |>
mosaic::inspect()
categorical variables:
name class levels n missing
1 work_type factor 3 2299 0
2 Residence_type factor 2 2299 0
3 smoking_status factor 3 2299 0
4 stroke factor 2 2299 0
distribution
1 Private (64.8%), Self-employed (18.7%) ...
2 Urban (51.2%), Rural (48.8%)
3 never smoked (52.6%) ...
4 No Stroke (94.5%), Stroke (5.5%)
quantitative variables:
name class min Q1 median Q3 max
1 age numeric 13.00 35.00 50.00 64.00 82.00
2 avg_glucose_level numeric 55.22 77.36 91.92 116.07 271.74
mean sd n missing
1 49.66464 18.34327 2299 0
2 108.29753 47.70805 2299 0
6.4 Scatterplot Matrices
Two scatter plot matrices were created for the purpose of visualizing
the relationship between each of the predictors and the outcome of
interest being studied for Stroke_Analysis_Train using the
code below:
spm1A <- Stroke_Analysis_Train |>
select(age, avg_glucose_level, Residence_type, bmi)
spm2A <- Stroke_Analysis_Train |>
select(stroke, smoking_status, work_type, bmi)
ggpairs(spm1A, title = "Scatterplot Matrix",
lower = list(combo = wrap("facethist", bins = 20))) +
theme(plot.background = element_rect(fill = "azure2")) +
theme( panel.background = element_rect(fill = "lemonchiffon1",
colour = "lemonchiffon1",
size = 0.5,
linetype = "solid")) +
labs(title = "Scatterplot Matrix",
subtitle =
"Comparison of age, average blood glucose level, residence type, and BMI")ggpairs(spm2A, title = "Scatterplot Matrix",
lower = list(combo = wrap("facethist", bins = 20)),
ggplot2:: aes()) + theme(plot.background = element_rect(fill = "azure2")) +
theme(panel.background = element_rect(fill = "lemonchiffon1",
colour = "lemonchiffon1",
size = 0.5, linetype = "solid")) +
labs(title = "Scatterplot Matrix",
subtitle =
"Comparison of stroke status, smoking status, employment category, and BMI ")It can be seen above that scatter plots between quantitative
variables demonstrate relative heteroscedasticity. Furthermore, the
distribution of bmi and age appear to be
relatively normal. The distribution of avg_glucose_level
shows some right skew. avg_glucose_lavel and
age both appear to have weak positive correlations with
bmi, with the relationship between age and
bmi being considerably weak (correlation of 0.059).
avg_glucose_level and age also appear to have
a weak positive relationship, with a correlation of 0.240.
In terms of qualitative values, there does not appear to be a strong
relationship between any of the quantitative predictors and qualitative
predictors. For example, residence_type and
age. The following numerical summary demonstrates that the
residence type is similar among the different age groups:
mosaic::favstats(age ~ Residence_type, data = Stroke_Analysis_Train) |>
kbl(caption = "Age Numerical Summary by Residence type", digits = 5) |>
kable_styling(font_size = 20) |>
kable_paper("hover", full_width = F)| Residence_type | min | Q1 | median | Q3 | max | mean | sd | n | missing |
|---|---|---|---|---|---|---|---|---|---|
| Rural | 13 | 34 | 50 | 63 | 82 | 49.14248 | 18.35837 | 1123 | 0 |
| Urban | 13 | 36 | 50 | 65 | 82 | 50.16327 | 18.32275 | 1176 | 0 |
A similar relationship between predictors as that which is seen in
the above table was also seen for avg_glucose_level and
Residence_type, in which there was a fairly even
distribution between the average blood glucose levels of people living
in both urban and rural areas.
mosaic::favstats(avg_glucose_level ~ Residence_type, data = Stroke_Analysis_Train) |>
kbl(caption = "Average Blood Glucose Level Numerical Summary by Residence type", digits = 5) |>
kable_styling(font_size = 20) |>
kable_paper("hover",
full_width = F)| Residence_type | min | Q1 | median | Q3 | max | mean | sd | n | missing |
|---|---|---|---|---|---|---|---|---|---|
| Rural | 55.27 | 77.670 | 92.87 | 116.290 | 271.74 | 108.4020 | 47.62479 | 1123 | 0 |
| Urban | 55.22 | 76.905 | 91.20 | 116.025 | 267.61 | 108.1978 | 47.80747 | 1176 | 0 |
Because smoking has been demonstrated to lead to life-ending medical conditions, age and smoking status were compared. However, it did not appear that there was an category of smoking that was vastly different in its age distribution. Minimum and maximum values were remarkably similar and interquartile ranges were not immensely varied.
mosaic::favstats(age ~ smoking_status, data = Stroke_Analysis_Train) |>
kbl(caption = "Age Numerical Summary by Smoking Status", digits = 5) |>
kable_styling(font_size = 20) |>
kable_paper("hover",
full_width = F)| smoking_status | min | Q1 | median | Q3 | max | mean | sd | n | missing |
|---|---|---|---|---|---|---|---|---|---|
| never smoked | 13 | 33 | 48 | 63 | 82 | 48.11332 | 19.03359 | 1209 | 0 |
| formerly smoked | 13 | 43 | 58 | 70 | 82 | 55.66025 | 17.28137 | 571 | 0 |
| smokes | 16 | 32 | 47 | 58 | 82 | 46.68208 | 16.28111 | 519 | 0 |
6.4.1 Discussion of colinearity
As discussed above, avg_glucose_level and
age appear to have a weak positive relationship, with a
correlation of 0.240. For the purpose of this study, a pearson
correlation value under 0.3 will be considered to be weak enough to not
consider there to be an issue with collinearity between predictors.
Thus, there does not appear to be any issues with collinearity that may
affect the model that will be made using data from the
Stroke_Analysis_Train data set.
6.5 Checking to see if a Transformation is Necessary:
In order to determine whether a linearizing transformation is necessary, a Box Cox plot method was employed. For this analysis, a linear model that will later be described as the “Big model” was used. The purpose of constructing a box-Cox plot is to sift through Tukey’s ladder of power transformations and determine which transformation is most suitable for linearizing the relationship between outcome and predictors.
par(bg = "azure2")
model_temp <- lm( bmi~ work_type +
Residence_type +
stroke +
smoking_status +
age +
avg_glucose_level, data = Stroke_Analysis_Train)
boxCox(model_temp) The boxCox approach is used in conjunction with Tukey’s
ladder of power transformations:
| Power (λ) | -2 | -1 | -1/2 | 0 | 1/2 | 1 | 2 |
| Transformation | 1/y2 | 1/y | \(\frac{1}{\sqrt[]{y}}\) | log(y) | \(\sqrt{y}\) | y | y\(^3\) |
In order to show the numeric value of λ displayed above, the
powerTransform function from the car package
was used:
powerTransform(model_temp)Estimated transformation parameter
Y1
-0.122814
The value of γ derived from the powerTransform function
is close to zero, indicating that a natural logarithmic transformation
of bmi may be useful for this data (using log,
which is used for natural logs in R). The following code is present for
the purpose of displaying the non-transformed data alongside a possible
logarithmic transformation in the format of a scatter plot of
avg_glucose_level (the key predictor) and bmi
(the key outcome):
p1 <- ggplot(Stroke_Analysis_Train, aes(x = avg_glucose_level, y = log(bmi))) +
geom_point() +
geom_smooth(method = "loess", formula = y ~ x, se = FALSE) +
geom_smooth(method = "lm", col = "red", formula = y ~ x, se = FALSE) + theme(
panel.background = element_rect(fill = "lemonchiffon1",
colour = "lemonchiffon1",
size = 0.5, linetype = "solid")) +
labs(title = "log(BMI) vs Avg. Glucose Level",
x = "Average Blood Glucose Level",
y = "log(BMI)" )
p2 <- ggplot(Stroke_Analysis_Train, aes(x = avg_glucose_level, y = bmi)) +
geom_point() +
geom_smooth(method = "loess", formula = y ~ x, se = FALSE) +
geom_smooth(method = "lm", col = "red", formula = y ~ x, se = FALSE) + theme(
panel.background = element_rect(fill = "lemonchiffon1",
colour = "lemonchiffon1",
size = 0.5, linetype = "solid")) +
labs(title = "BMI vs Avg. Glucose Level",
x = "Average Blood Glucose Level",
y = "log(BMI)" )
(p1)+ theme(plot.background = element_rect(fill = "azure2")) + p2 +
theme(plot.background = element_rect(fill = "azure2"))The above comparison demonstrates that when the natural log of bmi is used, a slightly more linear relationship is observed. For this reason, the transformation determined by the Box-Cox method will be kept for this analysis.
7 The Big Model
The following code is present for the purpose of fitting a linear
model to predict the natural log of bmi given information
on avg_glucose_level, work_type,
Residence_type, stroke,
smoking_status and age. a summary of each of
the coefficients is also provided. This model is titled
Big_model to reflect that it includes many variables.
Another model will be built called Small_model that only
includes one predictor (the key predictor,
avg_glucose_level)
Big_model <- lm(log(bmi) ~ avg_glucose_level +
work_type +
Residence_type +
stroke +
smoking_status +
age,
data = Stroke_Analysis_Train)
summary(Big_model)
Call:
lm(formula = log(bmi) ~ avg_glucose_level + work_type + Residence_type +
stroke + smoking_status + age, data = Stroke_Analysis_Train)
Residuals:
Min 1Q Median 3Q Max
-0.72432 -0.14660 -0.01194 0.13966 0.57687
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.286e+00 1.898e-02 173.084 < 2e-16
avg_glucose_level 5.733e-04 9.389e-05 6.106 1.2e-09
work_typePrivate -7.823e-03 1.196e-02 -0.654 0.5132
work_typeSelf-employed -1.747e-02 1.482e-02 -1.179 0.2386
Residence_typeUrban 3.990e-03 8.662e-03 0.461 0.6451
strokeStroke -1.491e-02 1.965e-02 -0.759 0.4480
smoking_statusformerly smoked 1.614e-02 1.068e-02 1.511 0.1308
smoking_statussmokes 1.722e-02 1.089e-02 1.581 0.1141
age 6.215e-04 2.654e-04 2.341 0.0193
(Intercept) ***
avg_glucose_level ***
work_typePrivate
work_typeSelf-employed
Residence_typeUrban
strokeStroke
smoking_statusformerly smoked
smoking_statussmokes
age *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.2072 on 2290 degrees of freedom
Multiple R-squared: 0.02459, Adjusted R-squared: 0.02118
F-statistic: 7.216 on 8 and 2290 DF, p-value: 1.732e-09
7.1 Effect Sizes: Coefficient Estimates
The following code is present for the purpose of displaying all
coefficients included in the Big_model linear model.
Included is information on the size, magnitude, and estimated effect
sizes with 90% confidence intervals for each coefficient.
tidy(Big_model, conf.int = TRUE, conf.level = 0.90) |>
select(term, estimate, std.error, conf.low, conf.high, p.value) |>
kbl(caption = "Coeficient Estimates with 90% confisence intervals",
digits = 5) |> kable_styling(font_size = 20) |>
kable_paper("hover", full_width = F)| term | estimate | std.error | conf.low | conf.high | p.value |
|---|---|---|---|---|---|
| (Intercept) | 3.28562 | 0.01898 | 3.25439 | 3.31686 | 0.00000 |
| avg_glucose_level | 0.00057 | 0.00009 | 0.00042 | 0.00073 | 0.00000 |
| work_typePrivate | -0.00782 | 0.01196 | -0.02751 | 0.01186 | 0.51319 |
| work_typeSelf-employed | -0.01747 | 0.01482 | -0.04187 | 0.00692 | 0.23855 |
| Residence_typeUrban | 0.00399 | 0.00866 | -0.01026 | 0.01824 | 0.64510 |
| strokeStroke | -0.01491 | 0.01965 | -0.04725 | 0.01742 | 0.44798 |
| smoking_statusformerly smoked | 0.01614 | 0.01068 | -0.00143 | 0.03371 | 0.13084 |
| smoking_statussmokes | 0.01722 | 0.01089 | -0.00071 | 0.03515 | 0.11408 |
| age | 0.00062 | 0.00027 | 0.00018 | 0.00106 | 0.01930 |
It can be seen in the above table that none of the coefficients for
any of the variables are particularly large, suggesting that none are
strong predictors of the natural log of bmi. Furthermore,
many of the 90% confidence intervals calculated for the coefficients had
ranges that included zero, suggesting that the sample size was not large
enough to determine whether or not they had a positive or negative
effect on the natural log of bmi. Taking into account only
the point estimates, the model suggests that being self employed,
working for a private company, and having suffered a stroke are
associated with lower ln(bmi) values while all other
variables positively correlate with ln(bmi).
7.1.1 Visualization of Estimates
Find below a plot summary containing the 90% confidence intervals of each of the coefficients displayed in the table above. In the following plot, the intercept is not displayed as it is several orders of magnitude greater than the largest predictor coefficient.
td <- tidy(Big_model, conf.int = TRUE, conf.level = 0.90) |>
filter(term != "(Intercept)")
ggplot(td, aes(x = term,
y = estimate,
col = term)) +
geom_point(size = 2) +
geom_errorbar(aes(ymin = conf.low, ymax = conf.high), width = 0.3) +
ylim(-0.05,0.04) + theme(legend.key=element_rect(colour="azure2")) +
theme(axis.text.x=element_text(angle=90,hjust=1)) +
labs(title ="Big Model Estimates",
subtitle = "With 90% Confidence intervals",
x="",
y="Estimate for coefficient (with 90% CI)") +
guides(x = guide_axis(angle = 90)) +
theme(panel.background = element_rect(fill = "lemonchiffon1",
colour = "lemonchiffon1",
size = 0.5, linetype = "solid")) +
theme(plot.background = element_rect(fill = "azure2"))7.2 Displaying the Equation For the Big Model
Using the point estimates calculated for each coefficient, an
equation can be written for the Big_model. The following
code displays this equation:
extract_eq(Big_model, use_coefs = TRUE, coef_digits = 5,
terms_per_line = 2, wrap = TRUE, ital_vars = TRUE)\[ \begin{aligned} \widehat{log(bmi)} &= 3.28562 + 0.00057(avg\_glucose\_level)\ - \\ &\quad 0.00782(work\_type_{Private}) - 0.01747(work\_type_{Self-employed})\ + \\ &\quad 0.00399(Residence\_type_{Urban}) - 0.01491(stroke_{Stroke})\ + \\ &\quad 0.01614(smoking\_status_{formerly\ smoked}) + 0.01722(smoking\_status_{smokes})\ + \\ &\quad 0.00062(age) \end{aligned} \]
The above equation indicates the following:
As the average blood glucose level for a patient increases by 1 mg/dl, the natural log of the BMI for that patient should increase by a factor of 0.00057.
If a patient is privately employed, the natural log of the BMI for that patient should decrease by a factor of 0.00782.
If a patient is self employed, the natural log of the BMI for that patient should decrease by a factor of 0.01747.
If a patient lives in an urban environment as opposed to a rural environment, the natural log of the BMI for that patient should increase by a factor of 0.00399.
If a patient has suffered a stroke, the natural log of the BMI for that patient should decrease by a factor of 0.01491.
If a patient has formerly smoked or is a current smoker, that patient’s natural log for BMI should increase by a factor of 0.01614 or 0.01722 respectively.
For every additional year that a patient is alive, that patient’s natural log for BMI should increase by a factor of .00062.
Please note that each of the above bullet points assume that all other variables are held constant
8 The Smaller Model
The following code is present for the purpose of fitting a linear
model to predict the natural log of bmi given information
only on avg_glucose_level, which is the key predictor for
this study. This model is termed Small_model, as it only
uses one variable to predict the natural log of bmi.
Small_model <- lm(log(bmi) ~ avg_glucose_level,
data = Stroke_Analysis_Train)
summary(Small_model)
Call:
lm(formula = log(bmi) ~ avg_glucose_level, data = Stroke_Analysis_Train)
Residuals:
Min 1Q Median 3Q Max
-0.71701 -0.14609 -0.00984 0.14059 0.56056
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.312e+00 1.073e-02 308.696 < 2e-16 ***
avg_glucose_level 6.176e-04 9.068e-05 6.811 1.24e-11 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.2074 on 2297 degrees of freedom
Multiple R-squared: 0.01979, Adjusted R-squared: 0.01937
F-statistic: 46.39 on 1 and 2297 DF, p-value: 1.236e-11
8.1 Effect Sizes: Coefficient Estimates
The following code is present for the purpose of displaying the
intercept point estimate as well as the point estimate for the
coefficient included in the small linear model for the key predictor
(avg_glucose_level). Included is information on the size,
magnitude, and estimated effect sizes with 90% confidence intervals for
the intercept and coefficient.
tidy(Small_model, conf.int = TRUE, conf.level = 0.90) |>
select(term, estimate, std.error, conf.low, conf.high, p.value) |>
kbl(caption = "Coeficient Estimates with 90% confisence intervals",
digits = 5) |>
kable_styling(font_size = 20) |>
kable_paper("hover", full_width = F)| term | estimate | std.error | conf.low | conf.high | p.value |
|---|---|---|---|---|---|
| (Intercept) | 3.31248 | 0.01073 | 3.29483 | 3.33014 | 0 |
| avg_glucose_level | 0.00062 | 0.00009 | 0.00047 | 0.00077 | 0 |
It can be seen above that the small model estimates a very small
positive effect on the natural log of the bmi value. It is
notable that the 90% confidence interval is comprised of only positive
values, indicating that with the sample size used to determine this
estimate, at an alpha level of 0.10, the estimate is statistically
meaningfully positive.
8.1.1 Plotting the estimate:
The following plot displays the 90% confidence interval for the
avg_glucose_level coefficient displayed in the table above.
In the following plot, the intercept for the Small_model is
not displayed as it is several orders of magnitude greater than the
predictor coefficient.
td2 <- tidy(Small_model, conf.int = TRUE, conf.level = 0.90) |>
filter(term != "(Intercept)")
ggplot(td2, aes(x = term, y = estimate, col = term)) +
geom_point(size = 2) +
geom_errorbar(aes(ymin = conf.low, ymax = conf.high), width = 0.3) +
ylim(-0.00,0.001) + theme(legend.key=element_rect(colour="azure2")) +
theme(axis.text.x=element_text(angle=90,hjust=1)) +
labs(title ="Big Model Estimates",
subtitle = "With 90% Confidence intervals",
x="",
y="Estimate for coefficient (with 90% CI)") +
guides(x = guide_axis(angle = 90)) +
theme(panel.background = element_rect(fill = "lemonchiffon1",
colour = "lemonchiffon1",
size = 0.5, linetype = "solid")) +
theme(plot.background = element_rect(fill = "azure2"))8.2 Displaying the Equation For the Small Model
Using the point estimate calculated the coefficient as well as the
intercept, an equation can be written for the Small_model.
The following code displays this equation:
extract_eq(Small_model, use_coefs = TRUE, coef_digits = 5,
terms_per_line = 2, wrap = TRUE, ital_vars = TRUE)\[ \begin{aligned} \widehat{log(bmi)} &= 3.31248 + 0.00062(avg\_glucose\_level) \end{aligned} \]
The above equation indicates the following:
As the average blood glucose level for a patient increases by 1 mg/dl, the natural log of the BMI for that patient should increase by a factor of 0.00062.
The Y intercept, or the value for BMI that is predicted to be present if a patient had a blood glucose level of 0 mg/dl, is 3.31248. While this number is useful for the purpose of plotting the model, it is practically impossible for a living person to have a blood glucose level of 0 mg/dl or a BMI below 8.
9 In sample comparison
9.1 Quality of fit
The following code is present for the purpose of displaying a table
that may be used to compare the Big_model to the
Small_model. This table includes information related to the
number of observations used to create each model, the degrees of
freedom, the Akaike information criterion (AIC) and the Bayesian
information criterion (BIC).
temp_a <- glance(Big_model) |>
select(-logLik) |>
round(digits = 5) |>
mutate(modelname = "Big Model")
temp_b <- glance(Small_model) |>
select(-logLik) |>
round(digits = 5) |>
mutate(modelname = "Small Model")
training_model_comparison <- bind_rows(temp_a, temp_b) |>
select(modelname, nobs, df, AIC, BIC, everything())
training_model_comparison |> kbl(caption = "Comparison of training models",
digits = 5) |>
kable_styling(font_size = 20) |>
kable_paper("hover", full_width = F)| modelname | nobs | df | AIC | BIC | r.squared | adj.r.squared | sigma | statistic | p.value | deviance | df.residual |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Big Model | 2299 | 8 | -702.5240 | -645.1217 | 0.02459 | 0.02118 | 0.20719 | 7.21593 | 0 | 98.30501 | 2290 |
| Small Model | 2299 | 1 | -705.2513 | -688.0306 | 0.01979 | 0.01937 | 0.20738 | 46.38534 | 0 | 98.78821 | 2297 |
The above table demonstrates that both the Small_model
and the Big_model are based off of the same number of
observations, which makes sense as the
Stroke_Analysis_Train data frame was composed of only
complete cases. Although the AIC is similar between the two models, it
is smaller in the big model and suggests that the big model is better
fit than the smaller model. Furthermore, the BIC for the
Big_model was smaller than that of the
Small_model, also suggesting that the big model
demonstrates better fit than the smaller model. This makes sense because
BIC tends to be smaller for models with more predictors. Finally, the R
squared value for the Big_model was larger than that of the
Small_model, indicating that the big model is able to
explain more of the variation seen in the natural log of
bmi than the smaller model. This was to be expected, as the
Big_model contains the same variable as the
Small_model in addition to more variables and R squared
values tend to favor models with more predictors.
9.2 Assessing Assumptions:
In order to exhibit the differences between observed and fitted response values for each of the models, residual plots were created
9.2.1 Residual Plots for the Big Model
The following code is present for the purpose of displaying residual plots for the big model, including a residuals vs fitted plot, a normal qq plot for the residuals, a scale location plot for the residuals, and a residuals vs leverage plot.
par(bg = "azure2")
par(mfrow = c(2,2)); plot(Big_model); par(mfrow = c(1,1))It can be observed above that the residuals vs. fitted plot showed linearity. Additionally, the normal QQ plot exhibited a normal distribution among the residual values for the big model, and the scale-location plot displayed heteroscedasticity and thus constant variance. Finally, there did not appear to be any exceptionally large Cook’s distance values on the residuals vs. leverage plot, indicating that none of the data points had seriously disproportionate influence on the model. Given the results for each of these plots, it doesn’t appear that any of the assumptions for multiple linear regression are present (linearity,independence, equal variance, or normality ).
9.2.2 Residual Plots for the Small Model
The following code is present for the purpose of displaying residual plots for the small model, including a residuals vs fitted plot, a normal qq plot for the residuals, a scale location plot for the residuals, and a residuals vs leverage plot.
par(bg = "azure2")
par(mfrow = c(2,2)); plot(Small_model); par(mfrow = c(1,1))In the above residual plots for Small_model, it can be
seen that, similar to Big_model, the residuals
vs. fitted plot showed linearity, the normal QQ plot
exhibited a normal distribution among the residual values for the big
model, and the scale-location plot displayed heteroscedasticity
and thus constant variance. Additionally, there did not appear to be any
exceptionally large Cook’s distance values on the residuals
vs. leverage plot, indicating that none of the data points had
seriously disproportionate influence on the model. Given the results for
each of these plots, it doesn’t appear that any of the assumptions for
multiple linear regression are present (linearity,independence, equal
variance, or normality ).
9.2.3 Assessing the Collinearity of the Big Model
Due to the fact that the big model had multiple predictors, the
potential impact of collinearity should be explored. The variance
inflation factor (VIF) is commonly used to quantify the impact of
collinearity for linear models. The following code is present for the
purpose of displaying the Variance Inflation Factor using the
vif function in the car package.
car::vif(Big_model) |>
kbl(caption = "Variance Inflation Factor Analysis of the Big Model",
digits = 5) |>
kable_styling(font_size = 20) |>
kable_paper("hover", full_width = F)| GVIF | Df | GVIF^(1/(2*Df)) | |
|---|---|---|---|
| avg_glucose_level | 1.07418 | 1 | 1.03643 |
| work_type | 1.11427 | 2 | 1.02742 |
| Residence_type | 1.00391 | 1 | 1.00195 |
| stroke | 1.07913 | 1 | 1.03881 |
| smoking_status | 1.04167 | 2 | 1.01026 |
| age | 1.26919 | 1 | 1.12659 |
The above VIF table does not exhibit any values for VIF above 1.27.
Generally, VIF values greater than or equal to 5 are considered to
indicate a serious issue with multicollinearity, as this would
correspond to R squared values at or above 0.8. No VIF values above are
near this, and for this reason there is is not considered to be any
notable multicollinearity issues in the Big_model.
Because the Small_model only uses one predictor, it is
not necessary to complete a multicollinearity assessment for that
model.
9.3 Comparing the models
Based on the above information comparing the Big_model
with the Small_model, it appears that although neither of
the models violated any important regression assumptions, the fit
quality is better for the Big_model than it is for the
Small_model. This is due to a number of important factors.
The Big_model showed smaller values for both AIC and BIC as
well as a larger R squared value than the Small_model. This
suggests that the fit quality is higher in the Big_model
and that it is able to explain more of the variation in bmi
values. Overall, it appears that the big model is a better model
to use to predict bmi values than the small model.
Although both of the models were shown to have some predictive value,
it should be noted that none of the coefficients for any of the
variables used to attempt to predict bmi were very large,
and thus it should be considered that none of the variables used in
either of the models had an exceptionally strong influence on the
bmi value predicted.
10 Model Validation
In this section, the two models created in the previous section will
be used to predict BMI using data from
Stroke_Analysis_Test.
10.1 Calculating the prediction errors
10.1.1 Big Model: Back-Transformation and Calculating Fits/Residuals
Using the augment function from within the
broom package, fitted and residual values were determined
and stored as bmi_fit and bmi_res respectively
for the big model on the Stroke_Analysis_Test data. Because
a natural logarithmic transformation was used in the development of each
of the linear models, a back-transformation is necessary in order to
proceed with this function. To back-transform the data, Euler’s number
was raised to a power the the value for bmi.
aug_big_model <- augment(Big_model, newdata = Stroke_Analysis_Test) |>
mutate(mod_name = "Big",
bmi_fit = exp(1)^(.fitted),
bmi_res = bmi - bmi_fit) |>
select(id, mod_name, bmi, bmi_fit, bmi_res, everything())
head(aug_big_model,4) |>
kbl(caption = "Residual and Fitted Values Calculated For the Big Model",
digits = 5) |>
kable_styling(font_size = 20) |>
kable_paper("hover", full_width = F)| id | mod_name | bmi | bmi_fit | bmi_res | age | avg_glucose_level | work_type | Residence_type | smoking_status | stroke | .fitted | .resid |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1665 | Big | 24.0 | 30.02899 | -6.02899 | 79 | 174.12 | Self-employed | Rural | never smoked | Stroke | 3.40216 | -0.22411 |
| 53882 | Big | 27.4 | 28.47604 | -1.07604 | 74 | 70.09 | Private | Rural | never smoked | Stroke | 3.34906 | -0.03852 |
| 58202 | Big | 30.9 | 29.37938 | 1.52062 | 50 | 167.41 | Self-employed | Rural | never smoked | Stroke | 3.38029 | 0.05046 |
| 56112 | Big | 37.5 | 30.99208 | 6.50792 | 64 | 191.61 | Private | Urban | smokes | Stroke | 3.43373 | 0.19061 |
10.1.2 Small Model: Back-Transformation and Calculating Fits/Residuals
Using the augment function from within the
broom package, fitted and residual values were determined
and stored as bmi_fit and bmi_res respectively
for the small model on the Stroke_Analysis_Test data. Once
again, a back transformation was required to remove the natural log that
had been applied to bmi values earlier.
aug_small_model <- augment(Small_model, newdata = Stroke_Analysis_Test) |>
mutate(mod_name = "Small",
bmi_fit = exp(1)^(.fitted),
bmi_res = bmi - bmi_fit) |>
select(id, mod_name, bmi, bmi_fit, bmi_res, everything())
head(aug_small_model,4) |>
kbl(caption = "Residual and Fitted Values Calculated For the Small Model",
digits = 5) |>
kable_styling(font_size = 20) |>
kable_paper("hover", full_width = F)| id | mod_name | bmi | bmi_fit | bmi_res | age | avg_glucose_level | work_type | Residence_type | smoking_status | stroke | .fitted | .resid |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1665 | Small | 24.0 | 30.56992 | -6.56992 | 79 | 174.12 | Self-employed | Rural | never smoked | Stroke | 3.42002 | -0.24196 |
| 53882 | Small | 27.4 | 28.66765 | -1.26765 | 74 | 70.09 | Private | Rural | never smoked | Stroke | 3.35577 | -0.04523 |
| 58202 | Small | 30.9 | 30.44350 | 0.45650 | 50 | 167.41 | Self-employed | Rural | never smoked | Stroke | 3.41587 | 0.01488 |
| 56112 | Small | 37.5 | 30.90191 | 6.59809 | 64 | 191.61 | Private | Urban | smokes | Stroke | 3.43082 | 0.19352 |
10.1.3 Combining the results
The following code is present for the purpose of creating a combined
tibble containing residual and fit information from both
aug_small_model and aug_big_model.
test_comp <- union(aug_big_model, aug_small_model) |>
arrange(id, mod_name)
head(test_comp, 6) |>
kbl(caption =
"Table of Residual and Fitted Values Calculated For the Big and Small Model",
digits = 5) |>
kable_styling(font_size = 20) |> kable_paper("hover", full_width = F)| id | mod_name | bmi | bmi_fit | bmi_res | age | avg_glucose_level | work_type | Residence_type | smoking_status | stroke | .fitted | .resid |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 10138 | Big | 24.8 | 28.97181 | -4.17181 | 41 | 74.85 | Private | Urban | formerly smoked | No Stroke | 3.36632 | -0.15548 |
| 10138 | Small | 24.8 | 28.75204 | -3.95204 | 41 | 74.85 | Private | Urban | formerly smoked | No Stroke | 3.35871 | -0.14787 |
| 10159 | Big | 31.6 | 28.91873 | 2.68127 | 41 | 99.80 | Private | Urban | never smoked | No Stroke | 3.36449 | 0.08867 |
| 10159 | Small | 31.6 | 29.19851 | 2.40149 | 41 | 99.80 | Private | Urban | never smoked | No Stroke | 3.37412 | 0.07904 |
| 10245 | Big | 35.8 | 28.39331 | 7.40669 | 54 | 77.52 | Self-employed | Rural | never smoked | No Stroke | 3.34615 | 0.23179 |
| 10245 | Small | 35.8 | 28.79949 | 7.00051 | 54 | 77.52 | Self-employed | Rural | never smoked | No Stroke | 3.36036 | 0.21759 |
10.2 Visualizing the Predictions
The following code is present for the purpose of visualizing the predicted and observed values for each model side-by-side.
ggplot(test_comp, aes(x = bmi_fit, y = bmi)) +
geom_point() +
geom_abline(slope = 1, intercept = 0, lty = "dashed") +
geom_smooth(method = "loess", col = "blue", se = FALSE, formula = y ~ x) +
facet_wrap( ~ mod_name, labeller = "label_both") +
labs(x = "Predicted BMI",
y = "Observed BMI",
title = "Observed vs. Predicted BMI",
subtitle = "Comparison of Big and Small Model Residual And Fitted Values",
caption = "Dashed line represents where Observed = Predicted") +
theme(
panel.background = element_rect(fill = "lemonchiffon1",
colour = "lemonchiffon1",
size = 0.5, linetype = "solid")) +
theme(plot.background = element_rect(fill = "azure2"))It can be seen in the plots above that both the
Big_model and the Small_model predict the
values for BMI conservatively and do not predict extreme values for BMI.
It may also be seen that the general shape of both models are similar,
with the Big_model demonstrating a slightly flatter Loess
smooth line. It appears from the scatter plots that the
Big_model predicted more BMI values below 28 than did the
Small_model and that neither model predicted a BMI value
above 33.
Summarizing the Errors
The following code is present for the purpose of displaying a table containing a summary of the prediction errors for each model. This includes the mean absolute prediction error, or MAPE and the square root of the mean squared prediction error, or RMSPE.
test_comp |>
group_by(mod_name) |>
dplyr::summarise(n = n(),
MAPE = mean(abs(bmi_res)),
RMSPE = sqrt(mean(bmi^2)),
max_error = max(abs(bmi))) |>
kbl(caption = "Error Summary", digits = 13) |>
kable_styling(font_size = 20) |>
kable_paper("hover", full_width = F)| mod_name | n | MAPE | RMSPE | max_error |
|---|---|---|---|---|
| Big | 986 | 4.922390 | 30.6549 | 49.8 |
| Small | 986 | 4.947876 | 30.6549 | 49.8 |
It can be noted in the table above that both the big and the small model demonstrated similar values for MAPE and RMSPE, with the only notable difference being in the MAPE, which was a slightly smaller for the big model. This indicates some possibility that the big model is more suited to answer the research question.
10.3 Identifying the largest error.
The following code is present for the purpose of determining the
observation in which the Big_model was least able to
accurately predict the value for bmi and the observation in
which the Small_model was least able to accurately predict
the value for bmi.
temp1 <- aug_big_model |>
filter(abs(bmi_res) == max(abs(bmi_res)))
temp2 <- aug_small_model |>
filter(abs(bmi_res) == max(abs(bmi_res)))
bind_rows(temp1, temp2) |>
kbl(digits = 5) |> kable_styling(font_size = 20) |>
kable_paper("hover", full_width = F)| id | mod_name | bmi | bmi_fit | bmi_res | age | avg_glucose_level | work_type | Residence_type | smoking_status | stroke | .fitted | .resid |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 59872 | Big | 49.3 | 28.43877 | 20.86123 | 38 | 80.82 | Private | Rural | never smoked | No Stroke | 3.34775 | 0.55017 |
| 18498 | Small | 49.8 | 29.27036 | 20.52964 | 44 | 103.78 | Private | Rural | formerly smoked | No Stroke | 3.37658 | 0.53144 |
It appears in the table above that the greatest errors for both models involved an underestimation of the BMI value and that both models incorrectly predicted the BMI of a patient living in a rural area and working for a private company whose actual BMI was close to 49 and whose age is below 45. Neither of these patients had suffered a stroke.
10.4 Validated R squared values
The following code is present for the purpose of calculating the R
squared values for the Big_model and the
Small_model.
#Big model:
cor(aug_big_model$bmi, aug_big_model$bmi_fit)^2[1] 0.04260902
#Small model:
cor(aug_small_model$bmi, aug_small_model$bmi_fit)^2[1] 0.04210552
| Model | R-Squared value |
| Big Model | 0.04260902 |
| Small Model | 0.04210552 |
It appears that around 4% of the variation in the BMI within the Stroke_Analysis_Test data is explained by both of the models. However, it appears that a slightly greater percent of variation in BMI is explained by the big model. This in conjunction with its slightly smaller value for Mean Absolute Percentage Error (MAPE) shows the big model to be a better predictor for BMI (albeit very marginally).
10.5 Comparing the Models
Although the models were similarly in their ability to predict values
for Stroke_Analysis_Test, it appears that the
Big_model is a better predictive model. This is because it
exhibited a smaller mean absolute prediction error than the
Small_model as well as an R squared value that was greater
than that of the Small_model. These indicate that the
Big_model was better able to predict the BMI value of a
patient included in the Stroke_Analysis_Test data set and
that the variation in BMI is better explained by this model.
11 Discussion
The Chosen Model
It was decided that the Big_model should be the chosen
model for this study. This is because it was demonstrated to be better
fitted to the training sample (Stroke_Analysis_train), and
better able to predict values of BMI for the test sample
(Stroke_Analysis_Test). It was found to have both a smaller
AIC and a smaller BIC value in the in-sample comparison than the
Small_model, indicating that it has a greater quality of
fit. This was further demonstrated in residual vs. fitted plots.
Furthermore, the Big_model was found to have a smaller MAPE
value than the Small_model when it was used to predict the
values of Stroke_Analysis_Test, showing that it had the
ability to predict BMI with less error. Finally, the
Big_model had a greater R squared value than the
Small_model, demonstrating that it was able to explain a
larger amount of the variation in bmi in the
Stroke_Analysis_Test data. Importantly, the
Big_model also did not violate any assumptions for linear
regression, as it was seen in the residual plots that it demonstrated
linearity, independence, equal variance, and normality in its
distribution. Had it violated any of these key assumptions its utility
as a linear model would be severely compromised.
Answering the research question
The research question for this analysis was:
How effectively can the body mass index of a hospital patient who has been evaluated for a stroke be predicted by the average blood glucose level of the patient that was recorded in their EHR and how is the quality of prediction changed when adjusting for age, employment type, residence type, stroke history, and smoking status?
To answer this question, both the Small_model and the
Big_model need to be taken into consideration as the
Small_model involves uses only the average blood glucose
level as a predictor and the Big_model adjusts for age,
employment type, residence type, stroke history, and smoking status. In
the in-sample comparison, the Big_model was found to have
been fit more closely to the values in the training data set than the
Small_model, demonstrating that the model that was built
with adjustment made for age, employment type, residence type, stroke
history, and smoking status had smaller residual values (between
predicted and observed BMI values) for the data it is built on than one
that only uses the average blood glucose level as a predictor. This
suggests that there is a possibility that BMI is able to be more
effectively predicted when adjustments for age, employment type,
residence type, stroke history, and smoking status are made, but further
testing is required to make that statement confidently. For this reason,
a model validation study was completed, in which the
Big_model and the Small_model were compared in
their ability to predict BMI values using data that they were not
trained on.
The model validation study applied each of the models to an
unfamiliar data set of 986 new observations (named
Stroke_Analysis_Test). Fitted and residual values for each
model were calculated and compared using numeric summaries and plots. It
was observed that the mean absolute prediction error for the model which
adjusted for age, employment type, residence type, stroke history, and
smoking status was lower than the MAPE for the model that only
incorporated average blood glucose level as a predictor for BMI.
Furthermore it was observed that the model in which age, employment
type, residence type, stroke history, and smoking status were adjusted
for had a higher R squared value than the model that only incorporated
average blood glucose level as a predictor for BMI. Both of these
observations further substantiated the previously mentioned possibility
that the quality of prediction of BMI increased when average blood
glucose level was used as a predictor and adjustments are made
for age, employment type, residence type, stroke history, and smoking
status.
While it appeared that the quality of BMI prediction increased with
the adjustments mentioned above, it is important to discuss the quality
of prediction in general for both models. In the
Small_model, where only average blood glucose was used as a
predictor, it was found that average blood glucose demonstrated a very
weak positive association with BMI, with an expected increase in the
natural log of BMI of 0.00062 for every additional 1 mg/dl of blood
glucose for in patient. This means that the smaller model predicts
that an extremely large increase or decrease in average blood glucose is
required to create even a modest change in bmi. Given this small
demonstrated influence it cannot be stated confidently that average
blood glucose as stated in a patient’s EHR is an effective predictor of
that patient’s BMI.
In the Big_model, none of the coefficients for any of
the variables used in the linear model had a value greater than 0.0147,
indicating that a only a very large change (in many cases an
unrealistically high change) in one of the quantitative variables would
be expected to inspire a large change in the predicted BMI value of a
patient. Importantly, a number of the coefficients for the variables
used in this model had 90% confidence intervals that contained zero,
indicating that the sample size of the data set used to create the model
was not large enough to establish a positive or negative correlation
between these variables and BMI with respect to an alpha level of .10.
None of the categorical variables were demonstrated to have the ability
to change the predicted BMI value of a patient by a factor that would
alter their status as underweight, healthy, overweight, or obese as per
CDC guidance. For this reason, it can be said that although a marginally
better quality of prediction is observed when predicting BMI using
average blood glucose level and adjustments are made for age,
employment type, residence type, stroke history, and smoking status,
the quality of prediction is still very low. R squared
values in the validation study demonstrated that both the
Big_model and the Small_model were each only
able to explain around 4% of the change in BMI within the
Stroke_Analysis_Test data set.
Given each of the points made above, it appears that the research question is best answered as follows:
The body mass index of a hospital patient who has been evaluated for a stroke cannot be effectively predicted by the average blood glucose level that was recorded in their EHR. However, slightly more accurate predictions of BMI for hospital patients who have been evaluated for stroke can be calculated by using average blood glucose level of the patient as a predictor and adjusting for age, employment type, residence type, stroke history, and smoking status of the patient.
My pre-analysis expectations were that the the body mass index of a hospital patient who has been evaluated for stroke status would be able to be effectively predicted by the average blood glucose level of that patient and that adjusting for age, employment type, residence type, stroke history, and smoking status would improve the predictive quality. Although I had correctly anticipated that the inclusion of additional variables would improve the predictive quality, the weak predictive ability of the average blood glucose level was not at all expected.
There are a number of limitations to this study. First, the sample size used to train the model was limited, which in turn limited the ability of both of the models to be to reflect the true distribution of data for patients who have been evaluated for stroke. Second, the models created were strictly linear and thus it is possible that although average blood glucose cannot be used in a linear model to predict BMI adequately, it may be a useful predictor when incorporated into a different type of regression model (such as a parabolic or cubic model).
Next Steps
In looking to create a prediction model for BMI using electronic health record data, there are a number of steps that may be taken in the future to create a more comprehensive model with a higher quality of BMI prediction. One item to consider is the amount of observations used to train the model. Because in 2022 the use of electronic health record systems is fairly ubiquitous, it is not unrealistic to consider that data from far more than 2299 patients could be used to create a predictive model for BMI. Additionally, more variables may be considered in making this model, specifically those that are known to modulate the mass of a person, such as participation in exercise or physical disability.
Additionally, one may consider using an approach to predicting BMI that does not involve a linear model, but rather using cubic or quadratic models to account for the possibility that different variables exert different effects on a patient’s BMI based on where that person lies on the distribution of BMI values.
Finally, further studies looking to establish a predictive model for BMI may consider looking beyond only patients who have been evaluated for a stroke as per their electronic health records. This may involve looking at populations that have been evaluated for other diseases or looking at a broader spectrum of hospital patients with less specific constraints. It is possible that people who have been evaluated for a stroke do not have particularly strong correlations between BMI and the variables examined in this study, but that other populations do.
Reflection
Reflecting on my approach to this study knowing what I do now, there are a few of things that had I known at the beginning of the study, I would have done differently.
Before I completed this project, I was not aware of how I could
display the confidence intervals for the coefficients of a linear model
created using ggplot2. In completing this project I found
it to be an incredibly useful way to visualize the influence of
variables in a linear model and a useful tool for explaining an
equation. If I had understood how to do this at the beginning of the
project I would have created a plot of coefficients and confidence
intervals before writing any explanation of my model so that I could
have more quickly and effectively described its equation and anticipated
the effects of each of its variables.
Additionally, when I began working on this project, I did not have a
strong understanding of the car package and how it can be
used to analyze the the Variance Inflation Factor in table form to
assess collinearity of a multiple linear regression model. When I began
the project, I was planning on only using pearson correlations and
scatter plot matrices to analyze collinearity, however, if I had been
better acquainted with the vif function, I may have chosen
additional variables for my big model, as I would more closely be able
to examine whether or not there was collinearity between any of my
predictors.
Session Information
sessioninfo::session_info()─ Session info ─────────────────────────────────────────────────────
setting value
version R version 4.2.1 (2022-06-23)
os macOS Big Sur ... 10.16
system x86_64, darwin17.0
ui X11
language (EN)
collate en_US.UTF-8
ctype en_US.UTF-8
tz America/New_York
date 2022-12-19
pandoc 2.19.2 @ /Applications/RStudio.app/Contents/MacOS/quarto/bin/tools/ (via rmarkdown)
─ Packages ─────────────────────────────────────────────────────────
package * version date (UTC) lib source
abind 1.4-5 2016-07-21 [1] CRAN (R 4.2.0)
assertthat 0.2.1 2019-03-21 [1] CRAN (R 4.2.0)
backports 1.4.1 2021-12-13 [1] CRAN (R 4.2.0)
base64enc 0.1-3 2015-07-28 [1] CRAN (R 4.2.0)
bookdown 0.30 2022-11-09 [1] CRAN (R 4.2.0)
broom * 1.0.1 2022-08-29 [1] CRAN (R 4.2.0)
bslib 0.4.1 2022-11-02 [1] CRAN (R 4.2.0)
cachem 1.0.6 2021-08-19 [1] CRAN (R 4.2.0)
car * 3.1-1 2022-10-19 [1] CRAN (R 4.2.0)
carData * 3.0-5 2022-01-06 [1] CRAN (R 4.2.0)
cellranger 1.1.0 2016-07-27 [1] CRAN (R 4.2.0)
checkmate 2.1.0 2022-04-21 [1] CRAN (R 4.2.0)
cli 3.4.1 2022-09-23 [1] CRAN (R 4.2.0)
cluster 2.1.4 2022-08-22 [1] CRAN (R 4.2.0)
cmprsk 2.2-11 2022-01-06 [1] CRAN (R 4.2.0)
colorspace 2.0-3 2022-02-21 [1] CRAN (R 4.2.0)
crayon 1.5.2 2022-09-29 [1] CRAN (R 4.2.0)
data.table 1.14.6 2022-11-16 [1] CRAN (R 4.2.0)
DBI 1.1.3 2022-06-18 [1] CRAN (R 4.2.0)
dbplyr 2.2.1 2022-06-27 [1] CRAN (R 4.2.0)
deldir 1.0-6 2021-10-23 [1] CRAN (R 4.2.0)
digest 0.6.31 2022-12-11 [1] CRAN (R 4.2.1)
distributional 0.3.1 2022-09-02 [1] CRAN (R 4.2.0)
dplyr * 1.0.10 2022-09-01 [1] CRAN (R 4.2.0)
ellipsis 0.3.2 2021-04-29 [1] CRAN (R 4.2.0)
Epi * 2.47 2022-06-26 [1] CRAN (R 4.2.0)
equatiomatic * 0.3.1 2022-01-30 [1] CRAN (R 4.2.0)
etm 1.1.1 2020-09-08 [1] CRAN (R 4.2.0)
evaluate 0.18 2022-11-07 [1] CRAN (R 4.2.0)
fansi 1.0.3 2022-03-24 [1] CRAN (R 4.2.0)
farver 2.1.1 2022-07-06 [1] CRAN (R 4.2.0)
fastmap 1.1.0 2021-01-25 [1] CRAN (R 4.2.0)
forcats * 0.5.2 2022-08-19 [1] CRAN (R 4.2.0)
foreign 0.8-83 2022-09-28 [1] CRAN (R 4.2.0)
Formula 1.2-4 2020-10-16 [1] CRAN (R 4.2.0)
fs 1.5.2 2021-12-08 [1] CRAN (R 4.2.0)
gargle 1.2.1 2022-09-08 [1] CRAN (R 4.2.0)
generics 0.1.3 2022-07-05 [1] CRAN (R 4.2.0)
GGally * 2.1.2 2021-06-21 [1] CRAN (R 4.2.0)
ggdist * 3.2.0 2022-07-19 [1] CRAN (R 4.2.0)
ggforce * 0.4.1 2022-10-04 [1] CRAN (R 4.2.0)
ggformula * 0.10.2 2022-09-01 [1] CRAN (R 4.2.0)
gghalves * 0.1.4 2022-11-20 [1] CRAN (R 4.2.0)
ggmosaic * 0.3.4 2022-12-10 [1] Github (haleyjeppson/ggmosaic@fb42e7b)
ggplot2 * 3.4.0 2022-11-04 [1] CRAN (R 4.2.0)
ggrepel 0.9.2 2022-11-06 [1] CRAN (R 4.2.0)
ggridges 0.5.4 2022-09-26 [1] CRAN (R 4.2.0)
ggstance * 0.3.5 2020-12-17 [1] CRAN (R 4.2.0)
glue 1.6.2 2022-02-24 [1] CRAN (R 4.2.0)
googledrive 2.0.0 2021-07-08 [1] CRAN (R 4.2.0)
googlesheets4 1.0.1 2022-08-13 [1] CRAN (R 4.2.0)
gower 1.0.0 2022-02-03 [1] CRAN (R 4.2.0)
gridExtra 2.3 2017-09-09 [1] CRAN (R 4.2.0)
gtable 0.3.1 2022-09-01 [1] CRAN (R 4.2.0)
haven 2.5.1 2022-08-22 [1] CRAN (R 4.2.0)
highr 0.9 2021-04-16 [1] CRAN (R 4.2.0)
Hmisc 4.7-1 2022-08-15 [1] CRAN (R 4.2.0)
hms 1.1.2 2022-08-19 [1] CRAN (R 4.2.0)
htmlTable 2.4.1 2022-07-07 [1] CRAN (R 4.2.0)
htmltools 0.5.4 2022-12-07 [1] CRAN (R 4.2.0)
htmlwidgets 1.5.4 2021-09-08 [1] CRAN (R 4.2.0)
httpuv 1.6.6 2022-09-08 [1] CRAN (R 4.2.0)
httr 1.4.4 2022-08-17 [1] CRAN (R 4.2.0)
interp 1.1-3 2022-07-13 [1] CRAN (R 4.2.0)
janitor * 2.1.0 2021-01-05 [1] CRAN (R 4.2.0)
jpeg 0.1-9 2021-07-24 [1] CRAN (R 4.2.0)
jquerylib 0.1.4 2021-04-26 [1] CRAN (R 4.2.0)
jsonlite 1.8.4 2022-12-06 [1] CRAN (R 4.2.0)
kableExtra * 1.3.4 2021-02-20 [1] CRAN (R 4.2.0)
knitr * 1.41 2022-11-18 [1] CRAN (R 4.2.0)
labeling 0.4.2 2020-10-20 [1] CRAN (R 4.2.0)
labelled 2.10.0 2022-09-14 [1] CRAN (R 4.2.0)
later 1.3.0 2021-08-18 [1] CRAN (R 4.2.0)
lattice * 0.20-45 2021-09-22 [1] CRAN (R 4.2.1)
latticeExtra 0.6-30 2022-07-04 [1] CRAN (R 4.2.0)
lazyeval 0.2.2 2019-03-15 [1] CRAN (R 4.2.0)
lifecycle 1.0.3 2022-10-07 [1] CRAN (R 4.2.0)
lubridate 1.9.0 2022-11-06 [1] CRAN (R 4.2.0)
magrittr * 2.0.3 2022-03-30 [1] CRAN (R 4.2.0)
MASS 7.3-58.1 2022-08-03 [1] CRAN (R 4.2.0)
Matrix * 1.4-1 2022-03-23 [1] CRAN (R 4.2.1)
mgcv 1.8-41 2022-10-21 [1] CRAN (R 4.2.0)
mime 0.12 2021-09-28 [1] CRAN (R 4.2.0)
modelr 0.1.10 2022-11-11 [1] CRAN (R 4.2.1)
modelsummary * 1.2.0 2022-11-26 [1] CRAN (R 4.2.0)
mosaic * 1.8.4.2 2022-09-20 [1] CRAN (R 4.2.0)
mosaicCore 0.9.2.1 2022-09-22 [1] CRAN (R 4.2.0)
mosaicData * 0.20.3 2022-09-01 [1] CRAN (R 4.2.0)
munsell 0.5.0 2018-06-12 [1] CRAN (R 4.2.0)
naniar * 0.6.1 2021-05-14 [1] CRAN (R 4.2.0)
nlme 3.1-160 2022-10-10 [1] CRAN (R 4.2.0)
nnet 7.3-18 2022-09-28 [1] CRAN (R 4.2.0)
numDeriv 2016.8-1.1 2019-06-06 [1] CRAN (R 4.2.0)
patchwork * 1.1.2 2022-08-19 [1] CRAN (R 4.2.0)
pillar 1.8.1 2022-08-19 [1] CRAN (R 4.2.0)
pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.2.0)
plotly 4.10.1 2022-11-07 [1] CRAN (R 4.2.0)
plyr 1.8.8 2022-11-11 [1] CRAN (R 4.2.1)
png 0.1-7 2013-12-03 [1] CRAN (R 4.2.0)
polyclip 1.10-4 2022-10-20 [1] CRAN (R 4.2.0)
productplots 0.1.1 2016-07-02 [1] CRAN (R 4.2.0)
promises 1.2.0.1 2021-02-11 [1] CRAN (R 4.2.0)
purrr * 0.3.5 2022-10-06 [1] CRAN (R 4.2.0)
R6 2.5.1 2021-08-19 [1] CRAN (R 4.2.0)
RColorBrewer 1.1-3 2022-04-03 [1] CRAN (R 4.2.0)
Rcpp 1.0.9 2022-07-08 [1] CRAN (R 4.2.0)
readr * 2.1.3 2022-10-01 [1] CRAN (R 4.2.0)
readxl 1.4.1 2022-08-17 [1] CRAN (R 4.2.0)
reprex 2.0.2 2022-08-17 [1] CRAN (R 4.2.0)
reshape 0.8.9 2022-04-12 [1] CRAN (R 4.2.0)
rlang 1.0.6 2022-09-24 [1] CRAN (R 4.2.0)
rmarkdown * 2.18 2022-11-09 [1] CRAN (R 4.2.0)
rmdformats * 1.0.4 2022-05-17 [1] CRAN (R 4.2.0)
rpart 4.1.19 2022-10-21 [1] CRAN (R 4.2.0)
rstudioapi 0.14 2022-08-22 [1] CRAN (R 4.2.0)
rvest 1.0.3 2022-08-19 [1] CRAN (R 4.2.0)
sass 0.4.2 2022-07-16 [1] CRAN (R 4.2.0)
scales 1.2.1 2022-08-20 [1] CRAN (R 4.2.0)
sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.2.0)
shiny 1.7.3 2022-10-25 [1] CRAN (R 4.2.0)
simputation * 0.2.8 2022-06-16 [1] CRAN (R 4.2.0)
snakecase 0.11.0 2019-05-25 [1] CRAN (R 4.2.0)
stringi 1.7.8 2022-07-11 [1] CRAN (R 4.2.0)
stringr * 1.5.0 2022-12-02 [1] CRAN (R 4.2.0)
survival 3.4-0 2022-08-09 [1] CRAN (R 4.2.0)
svglite 2.1.0 2022-02-03 [1] CRAN (R 4.2.0)
systemfonts 1.0.4 2022-02-11 [1] CRAN (R 4.2.0)
tables 0.9.10 2022-10-17 [1] CRAN (R 4.2.0)
tibble * 3.1.8 2022-07-22 [1] CRAN (R 4.2.0)
tidyr * 1.2.1 2022-09-08 [1] CRAN (R 4.2.0)
tidyselect 1.2.0 2022-10-10 [1] CRAN (R 4.2.0)
tidyverse * 1.3.2 2022-07-18 [1] CRAN (R 4.2.0)
timechange 0.1.1 2022-11-04 [1] CRAN (R 4.2.0)
tweenr 2.0.2 2022-09-06 [1] CRAN (R 4.2.0)
tzdb 0.3.0 2022-03-28 [1] CRAN (R 4.2.0)
utf8 1.2.2 2021-07-24 [1] CRAN (R 4.2.0)
vctrs 0.5.1 2022-11-16 [1] CRAN (R 4.2.0)
viridisLite 0.4.1 2022-08-22 [1] CRAN (R 4.2.0)
visdat 0.5.3 2019-02-15 [1] CRAN (R 4.2.0)
webshot 0.5.4 2022-09-26 [1] CRAN (R 4.2.0)
withr 2.5.0 2022-03-03 [1] CRAN (R 4.2.0)
xfun 0.35 2022-11-16 [1] CRAN (R 4.2.0)
xml2 1.3.3 2021-11-30 [1] CRAN (R 4.2.0)
xtable 1.8-4 2019-04-21 [1] CRAN (R 4.2.0)
yaml 2.3.6 2022-10-18 [1] CRAN (R 4.2.0)
zoo 1.8-11 2022-09-17 [1] CRAN (R 4.2.0)
[1] /Library/Frameworks/R.framework/Versions/4.2/Resources/library
────────────────────────────────────────────────────────────────────